The DDD-GUI is used to perform differential expression (DE) gene and transcript, differental alternaive spliced (DAS) gene and differential transcript usage (DTU) transcript analysis. To use our pipeline in your work, please cite:
Required R packages: shiny, shinydashboard, shinyFiles, tximport, edgeR, plotly, RUVSeq, eulerr, gridExtra, grid, ComplexHeatmap, Gmisc, limma
The following datasets are required in the pipeline (Figure 1):
Note: the analysis will read information according to the column names in the spreadsheets. Please use the same column names as in examples.
Figure 1: User provided datasets.
Four folders will be created in the working directory to save the analysis results:
The whole pipeline has a number of steps and each step is related to several datasets. This table is the summary of datasets required for the analysis. To generate/upload the datasets, the following options are provided:
Figure 3: Data information for the analysis. The columns:
Figure 4: Generate gene and transcript expression.
Once the read counts and TPMs are generated, the data is pre-processed with steps: Merge sequencing replicates (if exist), Remove low expressed transcripts/genes, Estimate batch effects (if exist) and Data normalisation. In each step, quality control plots are generated to optimise the paramters for pre-processing (Figure 5).
Figure 5: Data pre-processing. The figures were generated based on the RNA-seq data from study of Arabidopsis in response to cold (Calixto et al. 2018)
The sequencing replicates are generated by sequencing the same tissue several times, which do not add too much information to the technical variation. Sequencing replicates from the same biological replicate are often added to increase the depth. In the panel of Figure 6A, the sequencing replicate information will be extracted from the “srep” column of the “samples.csv” file ( Figure 1B) to indicate whether to merge the sequencing replicates.
The low expressed transcripts are filtered based on count per million (CPM) reads: \[ CPM=\frac{X_i}{N}\times 10^6 \] where \(X_i\) is read count of transcript \(i\) and \(N\) is the library size of the sample.
Note: The sample number cut-off \(n\) and CPM cut-off \(m\) are determined by the mean-variance trend plot (Figure 6B and Figure 6C). Read counts are assumed to be negative binomial distributed of which the mean and variance has a relation: \[ Var(\log_2X)=\frac{1}{\mu}+\phi \] where \(X\) is read count, \(\mu\) is the mean and \(\phi\) is the overdispersion. The expression variance decreasing monotonically with the increasing of the mean. In real RNA-seq data, the low expressed transcripts confound with noise and the expression distribution is different to the expressed transcripts. In the mean-variance trend plot, the low expressed transcripts cause a drop trend in low expression region (Figure 5 and Figure 6C). Therefore, the cut-offs \(n\) and \(m\) to filter the low expression can be optimised until the drop trend disappeared in the meam-variance trend plot.
Figure 6: Merge sequencing replicates and filter low expressed genes/transcripts.
PCA is a mathematical method to reduce expression data dimensionality while remaining most of the data variance. The reduction is performed by projecting the data to directions or principal components(PCs) of highest data varablity. For example. Therefore, the data main variance is accessible by investigate a few number of PCs rather than thousands of variables. In omic data analysis, the first two to four principal componets are often used to visualise the similarities and differences of samples, thereby determing the groups of samples. In the PCA scatter plot, the samples from the same condition often stay close, but far from the samples of other conditions. It can be used as evidence for data quality checking.
PCA plot also can be used to identify whether the RNA-seq data contain batch effects, which are caused by biological replications in different laboratory conditions. Compared with random noise, batch effects can be distinguished owning to the systematic biases of signals across replicates. For example, Figure 7A shows the PCA plot (PC1 vs PC2) of a RNA-seq data with 3 time-points T2, T10 and T19. Each time-point has 3 biological replicates. It can be seen that the first bioloigcal replicate (brep1) stay in a seperate cluster to other two replicates, which indicates a clear batch effect of the data. In this pipeline, the RUVSeq R package (Risso et al. 2014) is used to estimate the batch effects. The RUVSeq algorithm generates batch effect terms which can be incorporated in the design matrix of linear regression model for DE and DAS analysis, i.e.
\[ \text{observed expression = baseline effects + batch effects + noise} \]
It also generates a modified expression matrix in which the batch effects has been removed from the data. To avoid data over-modification, this dataset is only used to make PCA plot, but not used for downstream DE and DAS analysis (Figure 7B).
In this panel (Figure 7), users can select and visualise different PCs based on transcript level or gene level exprssion, or the average expression of biological replicates. The scatter points can be grouped and coloured according to biological replicates or conditions. Ellipses or polygons can be added to the plots to highlight the grouped clusters. Three options, RUVr, RUVs and RUVg, in the RUVSeq R package are provided to estimate the batch effects (see the package documentation for details). Users can “Update PCA plot” to visualise whether the batch effects are removed properly. If no clear batch effects in the PCA plot, please skip this step.
Figure 7: PCA plot before (A) and after (B) remove batch effects.
To enable unbiased comparisons, read counts must be normalised on the basis of sequencing depth across samples. Normalisation methods Trimmed Mean of M-values (TMM), Relative Log Expression (RLE) and upper-quartile are provided to reduce the effects from the systematic technical biases across samples (M. D. Robinson and Oshlack 2010; Bullard et al. 2010). Boxplots are used to visualise the expression distribution across samples before and after normalisation.
Figure 8: Data normalisation.
Statistics:
DE gene/transcript analysis: A gene/transcript is identified as DE in a contrast group if \(L_2FC\) of expression \(\geq\) cut-off and with adjusted p-value < cut-off.
Two pipelines are provided for DE analysis: “limma-voom” and “edgeR”. The “edgeR” pipeline includes two methods: “glmQL” (Genewise Negative Binomial Generalized Linear Models with Quasi-likelihood Tests) and “glm” (Genewise Negative Binomial Generalized Linear Models). In limma pipeline, \(L_2FCs\) are calculated based count per million reads (CPMs) while in edgeR, \(L_2FCs\) are based on read counts. From several real RNA-seq data analyses, high consensus is achieved between “limma-voom” and “glmQL” (>90%) and they have more stringent controls of false discovery rate than the “glm” method.
DAS gene and DTU transcript analysis: To test differential expression, the gene level expression is compared to transcript level expression in the contrast groups. A gene is DAS in a contrast group if adjusted p-value < cut-off and at least one transcript of the gene with \(\Delta PS \geq\) cut-off. A transcript is DTU if adjusted p-value < cut-off and \(\Delta PS \geq\) cut-off.
Two pipelines for expression comparative analysis:
In the GUI, contrast groups in csv file (Figure 1D) can be uploaded in panel Figure 9A. DE, DAS and DTU analysis pipeline, \(\Delta PS\) and cut-offs of test statistics can be selected/generated in panels Figure 9B and C. Subset of \(\Delta PS\) can be visualised in panel Figure 9D.
Figure 9: DE and DAS analysis.
This page includes panels to show the DE, DAS and DTU analysis results (Figure 10-12):
Figure 10: Number of DE genes, DAS genes, DE transcripts and DTU transcripts, and bar plots of up-down regulation numbers.
Figure 11: Euler diagram of numbers between contrast groups (A) and numbers to compare DE vs DAS genes and DT vs DTU transcripts (B).
Figure 12: Flowchart to show numbers of DE and/or DAS genes and DE and/or DTU transcripts.
Users can make heatmap for DE genes, DAS genes, DE transcripts and DTU transcripts identified in the analysis (Figure 13A). Users can also upload a gene or transcript list in csv file to make heatmap for specific list (Figure 13B). To make the heatmap, the average TPMs of conditions for the targets are standardized into z-scores. The dist and hclust R functions are used to perform the hierarchical clustering analysis. The heamaps and the target list in each cluster of the heatmaps can be saved to local folder.
Figure 13: Heatmap panel (A) and example user provided gene list in csv file for heatmap (B).
Gene and transcript expression profiles (TPMs or read counts) and PS (percent spliced) can be visualised by typing a gene name (Figure 14A). Users can also generate a number of plots by provides a gene list (Figure 13B) and all the plots will be saved to “figure” folder in the working directory (Figure 14B).
Figure 14: Plot of expression profiles and PS across conditions
Users can generate DE and DAS gene list by click “Generate” button (Figure 15B). These gene lists can be uploaded to Gene Ontology (GO) analysis tools (e.g. DAVID and agriGO) to generate GO annotation. To visualise the annotation plot, a csv file need to be uploaded to the GUI (Figure 15B). The file includes a “Category” column of CC (cellular component), BP (biological process) and MF (molecular function), a column of “Term” of GO annotation and columns of statistics to report the annotation enrichment, e.g. count, FDR, -log10(FDR), etc. (Figure 15A). In the panel, users can select different statistics to visulise. The selected gene list type will present in the plot title to distinguish whether the provided gene list is DE or DAS genes.
Figure 15: GO annotation plot
The table in this page lists the location of save files and the parameters for the analysis. If the information is wroing, users can double click the cells in the table to correct to values. Based on this information, report in three format, word, pdf and html, will be generated and saved in the “report” folder of working directory.
Figure 16: Generate report
## R version 3.5.1 (2018-07-02)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 7 x64 (build 7601) Service Pack 1
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_United Kingdom.1252
## [2] LC_CTYPE=English_United Kingdom.1252
## [3] LC_MONETARY=English_United Kingdom.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United Kingdom.1252
##
## attached base packages:
## [1] grid stats4 parallel stats graphics grDevices utils
## [8] datasets methods base
##
## other attached packages:
## [1] Gmisc_1.6.4 htmlTable_1.12
## [3] Rcpp_0.12.18 ComplexHeatmap_1.18.1
## [5] gridExtra_2.3 eulerr_4.1.0
## [7] RUVSeq_1.14.0 EDASeq_2.14.1
## [9] ShortRead_1.38.0 GenomicAlignments_1.16.0
## [11] SummarizedExperiment_1.10.1 DelayedArray_0.6.6
## [13] matrixStats_0.54.0 Rsamtools_1.32.3
## [15] GenomicRanges_1.32.6 GenomeInfoDb_1.16.0
## [17] Biostrings_2.48.0 XVector_0.20.0
## [19] IRanges_2.14.11 S4Vectors_0.18.3
## [21] BiocParallel_1.14.2 Biobase_2.40.0
## [23] BiocGenerics_0.26.0 plotly_4.8.0.9000
## [25] ggplot2_3.0.0 edgeR_3.22.3
## [27] limma_3.36.3 tximport_1.9.12
## [29] shinyFiles_0.7.1 shinydashboard_0.7.0
## [31] shiny_1.1.0
##
## loaded via a namespace (and not attached):
## [1] colorspace_1.3-2 rjson_0.2.20 hwriter_1.3.2
## [4] rprojroot_1.3-2 circlize_0.4.4 base64enc_0.1-3
## [7] GlobalOptions_0.1.0 fs_1.2.6 rstudioapi_0.7
## [10] bit64_0.9-7 AnnotationDbi_1.42.1 splines_3.5.1
## [13] R.methodsS3_1.7.1 DESeq_1.32.0 geneplotter_1.58.0
## [16] knitr_1.20 Formula_1.2-3 jsonlite_1.5
## [19] annotate_1.58.0 cluster_2.0.7-1 R.oo_1.22.0
## [22] compiler_3.5.1 httr_1.3.1 backports_1.1.2
## [25] assertthat_0.2.0 Matrix_1.2-14 lazyeval_0.2.1
## [28] later_0.7.4 acepack_1.4.1 htmltools_0.3.6
## [31] prettyunits_1.0.2 tools_3.5.1 bindrcpp_0.2.2
## [34] gtable_0.2.0 glue_1.3.0 GenomeInfoDbData_1.1.0
## [37] dplyr_0.7.6 rtracklayer_1.40.6 stringr_1.3.1
## [40] mime_0.5 XML_3.98-1.16 zlibbioc_1.26.0
## [43] MASS_7.3-50 scales_1.0.0 aroma.light_3.10.0
## [46] hms_0.4.2 promises_1.0.1 RColorBrewer_1.1-2
## [49] yaml_2.2.0 memoise_1.1.0 rpart_4.1-13
## [52] biomaRt_2.36.1 latticeExtra_0.6-28 stringi_1.1.7
## [55] RSQLite_2.1.1 genefilter_1.62.0 checkmate_1.8.5
## [58] GenomicFeatures_1.32.2 shape_1.4.4 rlang_0.2.2
## [61] pkgconfig_2.0.2 bitops_1.0-6 evaluate_0.11
## [64] lattice_0.20-35 purrr_0.2.5 bindr_0.1.1
## [67] htmlwidgets_1.2 bit_1.1-14 tidyselect_0.2.4
## [70] plyr_1.8.4 magrittr_1.5 R6_2.2.2
## [73] Hmisc_4.1-1 DBI_1.0.0 foreign_0.8-71
## [76] pillar_1.3.0 withr_2.1.2 nnet_7.3-12
## [79] forestplot_1.7.2 abind_1.4-5 survival_2.42-6
## [82] RCurl_1.95-4.11 tibble_1.4.2 crayon_1.3.4
## [85] rmarkdown_1.10 GetoptLong_0.1.7 progress_1.2.0
## [88] locfit_1.5-9.1 data.table_1.11.6 blob_1.1.1
## [91] digest_0.6.17 xtable_1.8-3 tidyr_0.8.1
## [94] httpuv_1.4.5 R.utils_2.7.0 munsell_0.5.0
## [97] viridisLite_0.3.0
Bullard, James H, Elizabeth Purdom, Kasper D Hansen, and Sandrine Dudoit. 2010. “Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments.” BMC Bioinformatics 11 (1): 94. doi:10.1186/1471-2105-11-94.
Calixto, Cristiane P G, Wenbin Guo, Allan B James, Nikoleta A Tzioutziou, Juan C Entizne, Paige E Panter, Heather Knight, Hugh Nimmo, Runxuan Zhang, and John W.S. Brown. 2018. “Rapid and Dynamic Alternative Splicing Impacts the Arabidopsis Cold Response Transcriptome.” The Plant Cell. American Society of Plant Biologists. doi:10.1105/tpc.18.00177.
Patro, Rob, Geet Duggal, Michael I Love, Rafael A Irizarry, and Carl Kingsford. 2017. “Salmon provides fast and bias-aware quantification of transcript expression.” Nature Methods 14 (4): 417–19. doi:10.1038/nmeth.4197.
Risso, Davide, John Ngai, Terence P. Speed, and Sandrine Dudoit. 2014. “Normalization of RNA-seq data using factor analysis of control genes or samples.” Nature Biotechnology 32 (9): 896–902. doi:10.1038/nbt.2931.
Robinson, Mark D., and Alicia Oshlack. 2010. “A scaling normalization method for differential expression analysis of RNA-seq data.” Genome Biology 11 (3): R25. doi:10.1186/gb-2010-11-3-r25.
Soneson, Charlotte, Michael I Love, and Mark D Robinson. 2016. “Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences.” Journal Article. F1000Research 4: 1521. doi:10.12688/f1000research.7563.2.